Language Technology for Normalisation of Less-Resourced Languages

نویسندگان

  • P. W. Wagacha
  • G. De Pauw
  • G-M de Schryver
  • M. L. Forcada
  • K. Sarasola
  • F. M. Tyers
  • Mikel L. Forcada
  • Guy De Pauw
  • Gilles-Maurice de Schryver
  • Kepa Sarasola
  • Francis M. Tyers
  • Peter Waiganjo Wagacha
چکیده

This paper describes the stages involved in implementing a corpus of spoken Irish. This pilot project (consisting of approximately 140K words of transcribed data) implements part of the design of a larger corpus of spoken Irish which it is hoped will contain approximately 2 million words when complete. It hoped that such a corpus will provide material for linguistic research, lexicography, the teaching of Irish and for development of language technology for the Irish language.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quizzes on Tap: Exporting a Test Generation System from One Less-Resourced Language to Another

It is difficult to develop and deploy Language Technology and applications for minority languages for many reasons. These include the lack of Natural Language Processing (NLP) resources for the language, a scarcity of NLP researchers who speak the language and the communication gap between teachers in the classroom and researchers working in universities and other centres of research. One appro...

متن کامل

Morphological analysis for less-resourced languages: Maximum Affix Overlap applied to Zulu

The paper describes a collaboration approach in progress for morphological analysis of less-resourced languages. The approach is based on firstly, a language-independent machine learning algorithm, Maximum Affix Overlap, that generates candidates for morphological decompositions from an initial set of language-specific training data; and secondly, language-dependent post-processing using langua...

متن کامل

Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR

This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...

متن کامل

Introduction to the special issue on processing under-resourced languages

The creation of language and acoustic resources, for any given spoken language, is typically a costly task. For example, a large amount of time and money is required to properly create annotated speech corpora for automatic speech recognition (ASR), domain-specific text corpora for language modeling (LM), etc. The development of speech technologies (ASR, Text-to-Speech) for the already highreso...

متن کامل

Basic Language Resources for Diverse Asian Languages: A Streamlined Approach for Resource Creation

The REFLEX-LCTL (Research on English and Foreign Language ExploitationLess Commonly Taught Languages) program, sponsored by the United States government, was an effort in simultaneous creation of basic language resources and technologies for under-resourced languages, with the aim to enrich sparse areas in language technology resources and encourage new research. We were tasked to produce basic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012